Exploratory Data Analysis

The first step in an analysis project is to explore the data. John W. Tukey1 established the area of exploratory data analysis over 50 years ago. It is a useful way to assess data cleanliness and to detect anomalies. Exploring the data will also facilitate the detection of patterns and relationships.2 This activity is done before data modeling and hypothesis testing. In fact, a good understanding of the data from this exploration will help to organize and develop effective and meaningful models.

Data Types

Variables in a model can be categorized into four types (often called data types): nominal, ordinal, interval, and ratio. The first two, nominal and ordinal, are often called categorical data and are frequently text data types. Nominal values are usually labels that differentiate one from another but do not have a natural order or ranking, such as gender, zip codes, and product IDs. The categories of a nominal value are mutually exclusive, and the only descriptive statistics we can calculate are counts and the mode. Ordinal values imply an order or sequence or ranking, such as academic grades, earthquake magnitudes, and scales like low-medium-high. The two descriptive statistics we can calculate for ordinal values are the mode and the median.

The third and fourth data types are numeric and can be used in calculations. For interval values the order and the difference between the values is meaningful, such as temperature, SAT scores, calendar dates, and latitudes, but they have no true zero value. The descriptive statistics we can calculate for interval values are the mean, median, mode, and standard deviation. A ratio value is similar to an interval but also includes a clear definition of zero and the notion that the ratio of two values has a meaningful interpretation. Some ratio examples are height, weight, length, pulse, and age. A person cannot have a negative height or age. The descriptive statistics we can calculate for ratio values are the mean, median, mode, standard deviation, ratio of largest value to smallest value, and many others.

It is very important to know the data types of your data to ensure you use appropriate methods and can interpret the results. For example, the mean and standard deviation of a list of zip codes are not meaningful. In analytics, we often simplify data types to two kinds: categorical and numeric. Nominal and ordinal are considered categorical data types, and internal and ratio are considered numeric data types. In statistics, these two data types are also sometimes called qualitative and quantitative.

Figure 10.1 describes several characteristics of the different data types. As can be seen, the categorical variables do not support numeric calculations, but do allow the mode and median to be determined. The numeric variables also support the calculation of the mean and measurements between values.

Figure 10.1: Characteristics of Data Types

The difference between interval and ratio can be seen in the pH scale, which is interval and not ratio. A pH of 3 is not twice as acidic as a pH of 6. The range of pH values is shown in Figure 10.2.

Figure 10.2: Range of pH Values

Data of a higher data type can be converted to a lower data type. For example, the ratio data type Age can be converted to an ordinal data type {Infant, Adolescent, Adult, Senior}. That is why it is a best practice to collect data in as high or most detailed form as possible, usually ratio or interval.

Measures of Central Tendency

Measures of Location

A measure of central tendency is a single value that attempts to describe the central position of a set of data. The measure that we are most familiar with is the mean, often called the average. There are several ways to think about the central position. The most common measures to summarize the location of central tendency are:

  • Mean — Sum of all values divided by the number of values (also called average), =average(range) or =average(number1, number2, …).

  • Median — The value such that one-half of the data lies above and below (also called 50th percentile), =median(range) or median(number1, number2, …).

  • Mode — The value (or values in case of a tie) that appear most often in the data, =mode(range) or mode(number1, number2, …). If the data contain no duplicate data points, returns #N/A error.

  • Outlier — A data value that is very different from most of the data (also called an extreme value).

The mean (or average) value is the most basic measure of location. One weakness of the mean is that it is influenced by extreme values. One remedy is to drop a fixed number of values at the upper and lower ends (often the top and bottom 10%) and then recalculate (sometimes called a trimmed mean). A weighted average is another modification to the mean calculation which multiplies a weight to each data point and divides the sum by the sum of the weights. This allows us to downweigh or give higher weight to certain values for special purposes.

The median value is the middle value in an odd-numbered list and the average of the two middle values in an even-numbered list. Because the median only uses the value at the center of the data, it is more robust to outliers (it is not influenced by extreme values as is the mean).

We can use the mean and median together as a rule of thumb to tell if a unimodal data distribution is skewed and which direction it is skewed. If the mean is right of the median, the distribution has a right or positive skew (long tail to the right and data bunched to the left), and if the mean is left of the median, the distribution has a left or negative skew (long tail to the left and data bunched to the right). (Note: This seems backwards to most people, but that is the way it is.) Figure 10.3 illustrates skewed distributions. This rule of thumb does not work for multimodal or discrete distributions.

Figure 10.3: Examples of Skewed Distributions

Image by Diva Jain, CC BY-SA 4.0 via Wikimedia Commons.

An outlier is a value that is very distant from other values in a dataset. Outliers may be true data, but they often are the result of errors such as mixing data of different units (kilograms vs. pounds).

Figure 10.4 is an example of the average in-state tuition of the 50 states plus DC. We see that the mean is $15,368 and the median is $14,615, and thus the distribution is skewed a bit to the right, or it has a positive skew. The other two columns, Out of State Tuition and Net Cost, are also skewed slightly to the right, or with positive skew. The first two columns of data also have a modal value of 14,470 and 15,638. Each of those values occurs twice. But note that there is no mode for the third column because every number is unique; there are no repeat numbers.

Figure 10.4: State Tuitions Showing Measures of Central Tendency

Measures of Variability

Variability measures whether the data are tightly clustered or spread out—in other words, are the data points close to the mean and the mode, or are they farther away from the mean and the median? The most common measures to summarize the variability in central tendency are:

  • Range — The difference between the largest and smallest values in a dataset, or the Excel formula of =MAX(range)-MIN(range).

  • Variance — Sum of squared deviations from the mean divided by n-1 where n is the number of data values (also called mean-squared-error). In Excel, =VAR(range).

  • Standard Deviation — The square root of the variance, or the Excel formula =STDEV.S(range).

  • Interquartile Range (IQR) – Difference between the 75th percentile and 25th percentile, using the Excel formula =QUARTILE.INC(range, 3) - QUARTILE.INC(range, 1).3

The simplest measure of variability is to calculate the range, i.e., the difference between the maximum value and minimum value. This calculation works well to get a first impression of the data. However, it is very sensitive to outliers. Outliers can be thought of as data points that are extreme or even unexpected values that usually exist apart from the rest of the data. Since the range includes these extreme outlier values, it may give a false impression of the true variability of the data.

The most common measure of data variability is the standard deviation. To calculate the standard deviation, we first calculate the variance. The variance is calculated by (1) squaring the difference between each data point and the mean, (2) summing up all these squared values, and (3) dividing by n-1. Note that squaring the difference eliminates negative values, so it essentially uses the absolute value of the difference from the mean. The standard deviation is just the square root of the variance. Standard deviation is more commonly used than variance because it is easier to interpret since it is on the same scale as the original data. The variance and standard deviation measures, like the range, are susceptible to the influence of outliers.

The previous figure, Figure 10.4, also shows values for the range and standard deviation. You can calculate the range by subtracting the tuition amount for Wyoming from the amount for Rhode Island. As can be seen, this is a large range of values. The standard deviation is also included in the figure. It shows that a large portion of the tuition amounts are within $6,080 of the mean value of $15,368.

Figure 10.5 illustrates the concept of standard deviation. The data points represented by the red area have a standard deviation of 10. The data points represented by the blue area have a standard deviation of 50. As we will see later, the standard deviation is also used to measure what percent of the data points are close to the mean or average.

Figure 10.5: The Standard Deviation Measures Data Dispersion

Image by JRBrown, public domain via Wikimedia Commons.

The other measure of variability that is often used is the interquartile range (IQR). This measure is based on median values and as such is not as sensitive to outlier data points. The basic idea is to divide the data points into four quartiles. Each quartile contains 25% of the data points. It does not depend on the value of those data points. Figure 10.6 illustrates this concept. The IQR is simply Q3 minus Q1 in the figure, which represents the middle 50% of the data points.

Figure 10.6: Interquartile Range

Image by Scribbr.

The process to manually calculate the IQR is the following:4

  1. Find the median of all data points. That is Q2 on the figure. If there is an odd number of data points, it is just the center point. For an even-numbered dataset, it is the average of the two middle data points.

  2. Find the median of the bottom half to get Q1. Again, for odd numbers it is the middle; for even datasets it is the average of the two middle data points.

  3. Find the median of the top half to get Q3.

  4. Subtract Q1 from Q3.

The QUARTILE function in Excel has been deprecated and replaced by two other functions, QUARTILE.INC and QUARTILE.EXC. The syntax for these functions is:

QUARTILE.INC (range, quart) — where quart is 0, 1, 2, 3, or 4.

Quartile of 0 is equivalent to MIN( ) and quartile of 4 is equivalent to MAX ( ). Quartile of 1, 2, and 3 give the values for the first, second, and third quartile, respectively.

QUARTILE.EXC(range, quart) — where quart is 1, 2, or 3. 0 and 4 are not valid.

The difference between INC and EXC is whether the central median (Q2) is included in the calculations for Q1 and Q3. INC includes the median and EXC excludes the median. As can be seen from the two figures, the INC version always yields a smaller IQR than the EXC version of IQR. Including the median draws Q1 and Q3 closer to the middle. Figure 10.7 illustrates QUARTILE.INC on the left and QUARTILE.EXC on the right. As can be seen, the IQR is substantially smaller for the INC version. Historically, the QUARTILE function that has been deprecated is equivalent to the QUARTILE.INC.

Looking at the figure, we notice that QUARTILE.INC includes five data points within the IQR, and QUARTILE.EXC includes seven data points within the IQR. In this instance, with only eleven total data points, there is not an exact answer that includes precisely 50% of the data points for the IQR.

To calculate the IQR, we select the mean value and those values up to the Q1 and Q3 values. So for the INC version, it selects two values on either side, resulting in total of five data points. For the EXC version, it selects three data points on either side, which results in seven total data points for the IQR. There are several reasons for this discrepancy. First, the dataset is very small with discrete values. Second, for small datasets, whether the dataset is odd or even will affect the answer substantially.

Figure 10.7: Quartile Calculations

Figure 10.8 illustrates how the IQR compares with a standard deviation for a normal distribution. For a normal distribution, the IQR covers an area from about 2/3 of the standard deviation on either side of the median. This is not necessarily true for a distribution that is not normal. As mentioned earlier, outlier data points can seriously affect the standard distribution but will not affect the IQR. In fact, one method used to determine what data points are considered to be outliers is to consider those data points that are 1.5 times smaller than Q1 or 1.5 times larger than Q3. The figure illustrates these ranges on a normal distribution curve.

Figure 10.4 contains values for IQR and standard deviation for the college tuition costs. The IQR is close to 90% of the standard deviation for all three columns. This would indicate a smaller standard deviation, which occurs when there is not a lot of dispersion of data at the extremes.

Figure 10.8: IQR versus Data Dispersion

Image by Jhguch, CC BY-SA 2.5 via Wikimedia Commons.

Excel QUARTILE versus Manual Calculation

It should be noted that when the dataset has an even number of entries for calculating Q1 and Q3, Excel assigns percentages to the values and calculates the Q1 median and the Q3 median using those percentages. Thus, the values for Q1 and Q3 that you may get with manual calculations will not be the same that Excel calculates with QUARTILE(range, 1) and QUARTILE(range.3). The actual number of data points within the IQR is the same with either calculation. Just know that the precise values of Q1 and Q3 will be different from the values resulting from manual calculation.5

Correlation

In exploratory data analysis, we often examine the correlation among variables. The most common correlation measure is the Pearson correlation coefficient, which is a measure of the linear association between two variables. Variables A and B are positively correlated if high values of A map to high values of B, and negatively correlated if high values of A map to low values of B. The correlation coefficient is a standardized scale that measures the extent to which numeric variables are associated with each other (from -1 to +1). We often create a table (called a correlation matrix) that shows the correlations between many variables.

There are multiple ways to compute correlation using Excel. The CORREL function can be used to find the correlation between two variables. A complete correlation matrix, which computes the pairwise correlation between a set of variables, can be computed using the Excel Data Analysis ToolPak. To use the Correlation function, from the Data ribbon > Data Analysis > select Correlation. Input to the correlation function is the data range, including the labels.6

Figure 10.9 is an example of a correlation matrix for various automobile makes and models. The input table is a range of about 35 different automobiles, each with 11 different properties. The data in the figure is also color formatted to show high positive correlation (dark green) to high negative correlation (dark red). Looking at the first column in the matrix, we can see several interesting correlations. MPG has a high negative correlation with cylinders, displacement, horsepower, and weight. It also has somewhat of a positive correlation with rear axle ratio, engine shape, and transmission. We can also see pairwise correlations that are close to zero, meaning no correlation. For example, the rear axle ratio has almost no correlation with the quarter-mile acceleration time.

Figure 10.9: Automobile Properties Correlation Matrix